Machine learning-based causal models for predicting the response of individual patients to dexamethasone treatment as prophylactic antiemetic

Risk-based strategies are widely used for decision making in the prophylaxis of postoperative nausea and vomiting (PONV), a major complication of general anesthesia. However, whether risk is associated with individual treatment effect remains uncertain. Here, we used machine learning-based algorithms for estimating the conditional average treatment effect (CATE) (double machine learning [DML], doubly robust [DR] learner, forest DML, and generalized random forest) to predict the treatment response heterogeneity of dexamethasone, the first choice for prophylactic antiemetics. Electronic health record data of 2026 adult patients who underwent general anesthesia from January to June 2020 were analyzed. The results indicated that only a small subset of patients respond to dexamethasone treatment, and many patients may be non-responders. Estimated CATE did not correlate with predicted risk, suggesting that risk may not be associated with individual treatment responses. The current study suggests that predicting treatment responders by CATE models may be more appropriate for clinical decision making than conventional risk-based strategy.

Conditional average treatment effect models. We predicted the individualized treatment response to dexamethasone for PONV prophylaxis using multiple CATE estimation algorithms (double machine learning [DML] 16 , doubly robust [DR] learner 17,18 , generalized random forests [GRF] 19 , and forest DML 20 ). The distributions of the estimated CATE, which reflects the treatment response, were skewed to negative values (Fig. 1), consistent with PONV risk reduction by dexamethasone 6 . In all models, CATE distribution had a peak near zero, corresponding to the non-responders. The estimated CATE values showed a strong correlation among the different models, suggesting the reproducibility of the estimates in different algorithms (Fig. 2).
For comparison, we also created two machine learning-based risk prediction models: (1) a model with covariates selected by stepwise selection (optimized risk model) and (2) a model with previously reported risk factors as covariates (base risk model). In terms of risk prediction, the performance of the optimized risk model (area under the receiver operating characteristic curve [AUROC] 0.714; 95% CI 0.683-0.746) was among the best of previously reported models 9 and outperformed the base risk model using previously reported risk factors as covariates (AUROC 0.635; 95% CI 0.602-0.668). The performance indicators of the risk prediction models are Table 1. Baseline patient characteristics of overall dataset, training/validation set, and test set a . PONV postoperative nausea and vomiting, COPD chronic obstructive pulmonary disease, PCI percutaneous coronary intervention, CABG coronary artery bypass graft, ASA-PS American Society of Anesthesiologists Physical Status, TIVA total intravenous anesthesia, ICU intensive care unit. a Data are expressed as No. (%) unless otherwise indicated.

Discussion
This retrospective cohort study identified the individuals likely to respond to dexamethasone treatment for PONV prophylaxis using CATE models, and only a small subset of patients may respond to the treatment. Furthermore, predicted risks were not associated with the estimated treatment responses. www.nature.com/scientificreports/ Although randomized control trials (RCTs) are considered the gold standard of evidence-based medicine, they cannot determine whether a treatment is beneficial to a specific individual. In practice, many clinical guidelines recommend risk-based patient selection for intervention 5,8 . Risk-stratified subgroup analyses in other clinical disciplines have suggested an association between risk and treatment effect [32][33][34] . However, a careful interpretation is required for subgroup analyses, because insufficient power in stratified samples can result in misleading results 35 .
In the current study, the results indicated that predicted risk may lack association with individual treatment responses. Prior studies in nephrology 22 and critical care 24 report that treatment effect heterogeneity models are superior to the risk prediction models in identifying the responder to the treatment, supporting the current results. Furthermore, our results indicate that many patients may be non-responders to dexamethasone treatment. Thus, direct prediction of treatment heterogeneity may be more efficient than conventional risk-based strategy.
Previously reported clinical applications of CATE models estimated the treatment effect from the difference in the predicted risk between the treatment groups in RCT data 22,24,36 . The methods applied in the current study were designed for observational data, with confounder adjustments intrinsic to the algorithm [16][17][18][19][20] . Modeling treatment effect heterogeneity in observational data has a significant advantage, enabling the use of large databases, such as electronic health records, that capture a broad and diverse population.
This study has several limitations. First, the accuracy of the CATE model cannot be evaluated directly because the ground truth is unobservable. Thus, we have evaluated the CATE models using surrogate measures of evaluation, including the reproducibility of the results among different CATE algorithms. However, we still cannot rule out the possibility of using biased measures for evaluation, which are discussed in further detail below. Second, we used observational data, and the estimates of individual treatment responses obtained by the CATE models can be biased if the identifying assumptions were violated. CATE models use the propensity score model to adjust for the confounders, and we have selected observable potential confounders for the covariates, including previously reported risk factors of PONV. The distribution range of the propensity score suggested that the positivity assumption holds. Furthermore, the timing of dexamethasone administration was always at anesthesia induction, and a previous meta-analysis reported that there was no significant difference in the treatment effect within the dosage used in this study (4-8 mg) 37 . Thus we assumed that Stable Unit Treatment Value Assumption (SUTVA) holds. Although we cannot completely rule out the influence of unobserved confounders, we consider that the identifying assumptions were reasonably satisfied. Third, results may not be generalizable to all patients because our data were derived from a single institution. We performed validation of our model by temporal splitting of the data 38 . Furthermore, our data include the coronavirus pandemic period, and the splitting point is in proximity to the first declaration of a state of emergency in Tokyo. Our results should be more generalizable than conventional temporal splitting, considering the environmental changes caused by coronavirus in the test set. Fourth, there may be selection bias because we excluded some samples, such as intubated patients, in which we could not evaluate the outcome.
The current results demonstrate a framework for identifying the responders to antiemetic dexamethasone treatment by applying machine learning-based causal models. Conventional risk prediction models may not be suitable for identifying a small subset of treatment responders, and the approach using CATE models may be a

Methods
Ethics. All data were extracted from institutional electronic health records after approval by the Ethics Committee of Teikyo University Hospital. All methods were carried out in accordance with the institutional guidelines and regulations. Informed consent was obtained from all the participants in the form of opt-out on the website. De-identified data were used for analysis.
Datasets. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline 39 . Analyses were based on a retrospective cohort of adult patients (age ≥ 18 years) who underwent general anesthesia at Teikyo University Hospital from January 2020 to June 2020. Extracted electronic health record data included demographics, routine preoperative evaluation, anesthesia records, and routine postoperative evaluation (Supplementary Table 3). Supplementary Fig. 6 shows the study flowchart.
Exclusion criteria were open-heart cardiac surgery and intubated or unconscious patients. Patients without extubation or unexpected events within 24 h of surgery (emergency re-operation, intubation, intensive care unit [ICU] admission, or patient escape) precluding outcome observation were regarded as censored data and were excluded. Patients discharged on postoperative day 1, within 24 h of surgery, were included in the analysis. No www.nature.com/scientificreports/ patient was discharged on postoperative day 0. Three patients with missing data (anesthesia duration, 0.15%) and 18 patients with dexamethasone treatment dose < 4 mg (0.89%) were also excluded from the analysis. Data were split temporally into the training/validation set (January-March 2020) and the test set (April-June 2020). A state of emergency was declared in Tokyo on April 7, 2020, due to the coronavirus pandemic. Less urgent elective surgery and postsurgical ICU admission were restricted, and the environment changed dramatically from infection control procedures in the test set. We chose temporal splitting to show the generalizability of the CATE models in temporally different datasets 38 . Random, non-temporal splitting of the datasets was used for sensitivity analysis. For cross-validation, the training/validation set was further randomly split into the training set and the validation set.
Primary outcome. Outcome Y of interest was nausea or vomiting within 24 h of surgery, assessed on a binary scale. The assessment was based on routine postoperative evaluation by anesthesiologists on postoperative day 1 and routine nurse evaluation in the postanesthesia care unit, ICU, or general ward.
Treatment. The treatment T considered was an intravenous bolus of 4-8 mg dexamethasone at anesthesia induction on a binary scale, which is the recommended dose in the guideline of PONV management 5 . A previous meta-analysis indicated no significant difference in the incidence of PONV between 4-5 mg and 8-10 mg dexamethasone treatment 37 .
Conditional average treatment effect models. CATE models use machine learning algorithms to estimate treatment effect, conditional on the combination of covariates reflecting patient characteristics 40 19 , and forest DML 20 ) and compared their performance with the risk prediction models. Let Y denote the outcome of interest, T denote the treatment, X denote the covariates characterizing the individuals, and W denote the observed confounders.
We used L2-regularized logistic regression for the nuisance parameter estimation, except for GRF, which is designed to use random forest. L2 regularization adds a penalty term weighted by the square of the coefficient to avoid overfitting. Ridge regression, which is an L2-regularized linear regression, was used for the final stage regression model in DML and DR learner. Doubly robust (DR) learner. This algorithm is a modified version of conventional doubly robust approach 42 , and estimates CATE θ(X) using the outcome prediction model and propensity score model as the nuisance parameters. Parameters conditional to each treatment level are defined for binary treatment T = t ∈ {0, 1}: potential outcome Y t , risk prediction model m t (X, W), propensity score model e t (X, W), and error coefficient γ t . The following models are assumed: www.nature.com/scientificreports/ Independent and identically distributed samples labeled i = 0, 1, …, n, each consisting of the following parameters are defined: the outcome Y i ∈ R , the treatment T i ∈ {0, 1}, the covariates X i ∈ R , and the observed confounders W i ∈ R . The following estimates of potential outcomes Y DR i,t are constructed for t = 0 and t = 1: CATE θ(X) is estimated by solving the regression model over a parameter target class : Generalized random forest (GRF). This algorithm estimates CATE from the local regression of moment equation 41 , using data-adapted weight obtained from the modified random forest with splitting criteria which maximizes heterogeneity. The confounders are adjusted by residualizing the outcome Y and the treatment T, using predicted outcome m(X, W) = E[Y | X, W] and propensity score e(X, W) = E[T | X, W] as the nuisance parameters. These parameters are predicted by conventional random forest. Subsequent steps are performed on residualized data instead of the original. Subsamples are chosen randomly from the sample without replacement, then split into equal size samples for the splitting phase and estimation phase. Such partitioning is called "honest" when the information used for splitting is never used for estimation. Honesty avoids overfitting and ensures statistical inference. In the splitting phase, the causal tree is grown by splitting the sample space, maximizing the heterogeneity of the estimated treatment effect between the child nodes. Numerical approximations of heterogeneity based on gradient tree algorithms are made to reduce computational costs. The terminal node in which each sample fall represents a cluster with similar propensity. In the estimation phase, samples are fitted to the causal tree to determine which terminal node it falls in. Data-adaptive weight α is calculated as a list of neighboring training samples, weighted by the frequency it fell in the same terminal node as the test sample. CATE θ is estimated by solving the weighted moment equation using this list of data-adaptive weight and the score function ψ: Forest double machine learning (DML). This algorithm estimates CATE using the moment equation of DML in the splitting phase of GRF. The original study 20 used local fitting of the nuisance parameters, but we used a modified version 40 with the global fitting of nuisance parameters to reduce computational costs.
Risk prediction models. We created two machine learning-based PONV risk prediction models to compare with CATE models: base risk model with previously reported risk factors of PONV 5,9 (age, anesthesia duration, sex, history of motion sickness or PONV, nonsmoker, postsurgical opioid infusion) selected as covariates and optimized risk model with covariates chosen by stepwise selection. The covariates are provided in Supplementary Table 3. L2-regularized logistic regression was used for both models. We performed fivefold crossvalidation, and the models with the highest area under the receiver operating characteristic curve (AUROC) were selected.

Model interpretation.
We used Shapley additive explanations (SHAP) to interpret the model estimation 27 .
SHAP is a game theory-based approach to explain the output of a machine learning model. We used SHAP to assess the contribution of each covariate in the CATE estimation.
Uplift curve evaluation. The accuracy of the CATE model cannot be evaluated directly because its true value is unobservable. Thus, we evaluated the models using the uplift curve, which is a popular metric for evaluating CATE models 30,43 . The samples were sorted by the rank of the estimated CATE values, and subsamples consisting of top k samples (k = 1, 2, …, n; n, total sample size) were created for each value of k. The uplift curve f(k) plots the estimated difference in PONV events between the treated and untreated groups, calculated from the observed outcome in each subsample: The baseline plots the expected uplift curve value when the subsamples consist of a random CATE: Pr[T = t|X, W] = e t (X, W). www.nature.com/scientificreports/ If the value of CATE is estimated correctly, a greater decrease in PONV events should be observed in the uplift curve compared to the baseline. For statistical analysis, the area under the uplift curve (AUUC) 30,31 was calculated as the cumulative difference between the baseline and the uplift curve values: A greater positive AUUC indicates better model performance in identifying the responder to the treatment. The AUUC of a null model is zero. The 95% CI of the AUUC was estimated from 2000 bootstrap resampling.
The uplift curve was originally designed for the evaluation of randomized data 30,43,44 . We modified the approach by separately adjusting for confounders in all subsamples constituting the uplift curve. The confounders were adjusted using inverse probability weighting (IPW): or a doubly robust estimator 42 : The propensity score e(W) and risk model conditional on treated m 1 (W) and untreated m 0 (W) were predicted by L2-regularized logistic regression, using observed confounders W as covariates 45 .
Model parameter selection. The samples were split into the training/validation set and the test set. All procedures for model parameter selection were performed in the training/validation set, which was further split into the training set and the validation set for evaluation. Let X denote the covariates characterizing the individuals and W denote the observed confounders. The observed confounders W were selected from previously reported risk factors 5,9 and expert opinions as fixed parameters. The parameters of CATE models, including covariates X, were selected by stepwise selection with threefold cross-validation, and those with the highest AUUC in the validation set were used. The parameters of risk prediction models were selected similarly, except fivefold cross-validation was used and AUROC was used for performance evaluation. Different fold cross-validation was used for CATE model and risk prediction model because of the difference in the required sample size for evaluation. Candidate and selected covariates are provided in Supplementary Table 3.

Sensitivity analysis.
We conducted four sensitivity analyses. First, we created a placebo treatment by post hoc assignment of 2000 random binary variables to ensure the lack of heterogeneity in the absence of treatment effects. Second, we evaluated the model performance in 2000 random non-temporal splitting of datasets. Third, we created samples excluding emergency surgery (n = 217; 10.7%) to ensure that the heterogeneity was not due to inadequate presurgical evaluation. Fourth, we evaluated the AUUC excluding sample proportion below 0.3 or 0.4 in the uplift curve to ensure that the result was not due to insufficient confounder adjustment in a small subsample size. Statistical analysis. All analyses were performed using Python version 3.8.9 and the following add-on libraries: Scikit-learn package version 0.24.2 for all machine learning models, EconML 40 version 0.12.0 for all CATE models, and SHAP 27 version 0.39.0 for model interpretation. A two-sided P value of < 0.05 was considered significant.

Data availability
The deidentified data are available from the corresponding author upon reasonable request. .